Mostlygeek is a boutique software publisher that focuses on developer-centric tools for the emerging local AI ecosystem, with its single flagship utility “llama-swap” designed to sit between the popular llama.cpp inference engine and any front-end that needs to hot-swap large-language-model weights without restarting the server. The program monitors a designated folder for new quantized GGUF files, gracefully unloads the current model from GPU/CPU memory, loads the replacement, and resumes inference at the exact prompt position, eliminating the multi-second downtime usually required for manual restarts. Typical use cases include chat-frontends that let users toggle between code-generation, creative-writing, and function-calling models on demand; research rigs that benchmark successive fine-tunes against the same prompt set; and home-lab tinkerers who run different language or vision models throughout the day without breaking their open WebUI sessions. Written in lean Go, llama-swap exposes a drop-in proxy on localhost that mimics the OpenAI API, so existing clients require no code changes. Configuration is handled through a single YAML file where memory limits, GPU layers, and model aliases are declared, while built-in health endpoints report load status to orchestrators like Docker Compose or systemd. Because the tool is fully static-linked, it can be deployed on Windows workstations, headless Linux boxes, or M-series Macs with equal ease. Mostlygeek’s software is offered for free on get.nero.com, where downloads are delivered through trusted Windows package sources such as winget, always pull the latest release, and can be installed individually or batched alongside other applications.
Model swapping for llama.cpp
Details